[TRTLLM-10029][scheduler] Re-implement MicroBatchScheduler and CapacityScheduler in Python by lancelly · Pull Request #10273 · NVIDIA/TensorRT-LLM

lancelly · 2025-12-24T09:20:52Z

As titled. This PR is our first step to refactor the scheduler.
Goal:

Re-implement the existing C++ MicroBatchScheduler and CapacityScheduler logic in Python 1:1, without architectural changes.

Deliverables:

PyMicroBatchScheduler & PyCapacityScheduler classes.
Integration into PyExecutor behind the feature flag.

The overhead of python scheduler seems to be acceptable even in scenarios where the host overhead is the bottleneck. The benchmark results are:

E2E benchmarks of GPT-OSS-120B:
- Config: max throughput for GB200 + GPT-OSS-120B + Agg
- Server: ADP2 + max_batch_size 1536
- Client: max-concurrency 3072 + isl_1k_osl_1k + num_prompts 30720
- Result:
  - Output token throughput has around 1.3% gap after 50runs.
  - Schedule time accounts for approximately 4.4% of host_step_time when using cpp scheduler. When using the cpp scheduler, the average per iteration host_step_time is 45.11 ms, while with the python scheduler it is 46.02 ms, an increase of 2.0%.
E2E benchmarks of Llama-3.2-1B:
- Config: GB200 + Llama-3.2-1B + Agg
- Server: ADP2 + max_batch_size 1536
- Client: max-concurrency 3072 + isl_1k_osl_1k + num_prompts 30720
- Result:
  - Output token throughput has around 1.4% gap after 50runs.
  - Schedule time accounts for approximately 4.45% of host_step_time when using cpp scheduler. When using the cpp scheduler, the average per iteration host_step_time is 44.53 ms, while with the python scheduler it is 45.31 ms, an increase of 1.75%.

Details can be found in: Unified Python SPMD Scheduler Execution Plan & Performance Strategy

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>

lancelly · 2025-12-24T09:22:16Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-24T09:28:30Z

PR_Github #29796 [ run ] triggered by Bot. Commit: 411c254

coderabbitai · 2025-12-24T09:29:02Z

📝 Walkthrough

Walkthrough

These changes extend Python bindings for GenLlmReq and KVCacheManager C++ classes, add an environment variable to enable Python-based scheduling, and introduce a comprehensive Python scheduling framework with capacity and micro-batch scheduling policies as an alternative to C++ scheduler components.

Changes

Cohort / File(s)	Change Summary
C++ Pybind/Nanobind Bindings `cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp`, `cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp`	Expose new GenLlmReq Python methods: `get_unique_tokens(beam)` and `get_unique_tokens()` overloads, plus `get_encoder_unique_tokens()` returning optional VecUniqueTokens. Adjust binding chain on `use_draft_model` to enable additional chained bindings.
KV Cache Manager Bindings `cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp`, `cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp`	Add Python bindings for `find_new_context_block(unique_tokens, llm_request)` on BaseKVCacheManager and `scheduling_has_free_blocks(num_required, window_size)` on KVCacheManager, delegating to underlying C++ implementations.
Scheduler Initialization & Configuration `tensorrt_llm/__init__.py`, `tensorrt_llm/_torch/pyexecutor/_util.py`	Set `TLLM_USE_PYTHON_SCHEDULER=1` environment variable on startup. Add conditional logic in `create_py_executor_instance` to select SimpleUnifiedScheduler when flag is enabled; otherwise retain existing C++ scheduler selection logic.
Python Scheduling Framework `tensorrt_llm/_torch/pyexecutor/scheduler.py`	Introduce comprehensive Python-based scheduling system: PyCapacityScheduler (orchestrator with policy-based fitting), PyMicroBatchScheduler (encoder/context/generation batching), and SimpleUnifiedScheduler (composite runner). Add SchedulerPolicyBase with MaxRequestsPolicy, GuaranteedNoEvictPolicy, MaxUtilizationPolicy implementations; block-tracking managers; ChunkingPolicy enum; and state/prioritization logic mirroring C++ behavior.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Executor as Executor
    participant Sched as SimpleUnifiedScheduler
    participant Capacity as PyCapacityScheduler
    participant MicroBatch as PyMicroBatchScheduler
    participant KVCache as KVCacheManager
    participant Policy as SchedulerPolicy

    Executor->>Sched: schedule(pending_requests, running_requests, kv_cache_manager)
    activate Sched
    
    Sched->>Capacity: schedule(pending, running, kv_cache_manager)
    activate Capacity
    
    Capacity->>Policy: get_new_request_ids(pending)
    activate Policy
    Policy->>Capacity: filtered_request_ids
    deactivate Policy
    
    loop For each candidate request
        Capacity->>KVCache: find_new_context_block(unique_tokens, request)
        KVCache->>Capacity: context_block_info
        Capacity->>Capacity: fit_request_to_blocks()
    end
    
    Capacity->>Sched: scheduled_requests, paused_requests
    deactivate Capacity
    
    Sched->>MicroBatch: schedule(scheduled_requests, kv_cache_manager)
    activate MicroBatch
    
    MicroBatch->>MicroBatch: compute_chunk_sizes()
    
    rect rgb(200, 220, 255)
        note right of MicroBatch: Encoder phase
        MicroBatch->>KVCache: scheduling_has_free_blocks()
        KVCache->>MicroBatch: has_free
    end
    
    rect rgb(220, 240, 220)
        note right of MicroBatch: Context phase
        MicroBatch->>MicroBatch: select_requests_for_context()
    end
    
    rect rgb(255, 240, 200)
        note right of MicroBatch: Generation phase
        MicroBatch->>MicroBatch: select_requests_for_generation()
    end
    
    MicroBatch->>Sched: SchedulerOutput (batches, tokens)
    deactivate MicroBatch
    
    Sched->>Executor: SchedulerOutput
    deactivate Sched

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 52.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description check	❓ Inconclusive	The PR description provides technical context and benchmark results but lacks key template sections like PR title format, clear problem statement, test coverage details, and completion checklist.	Add a properly formatted title following [JIRA/ticket][type] format, clearly state the problem/solution, explicitly list test cases, and complete the PR checklist items.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: re-implementing MicroBatchScheduler and CapacityScheduler in Python, with the JIRA ticket properly referenced.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tensorrt-cicd · 2026-01-14T13:49:36Z

PR_Github #31953 [ run ] completed with state SUCCESS. Commit: 82fac4d
/LLM/main/L0_MergeRequest_PR pipeline #24754 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

lancelly · 2026-01-14T14:43:04Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-14T14:48:47Z

PR_Github #31985 [ run ] triggered by Bot. Commit: 82fac4d

tensorrt-cicd · 2026-01-14T15:43:47Z

PR_Github #31985 [ run ] completed with state SUCCESS. Commit: 82fac4d
/LLM/main/L0_MergeRequest_PR pipeline #24778 completed with status: 'SUCCESS'

Funatiq

Great work. I haven't look at the implementation in detail yet. Two suggestions:

I think we should add a simple test to CI so that we don't break the functionality by accident. E.g. run test_overlap_scheduler.py with the Python scheduler too.
Can we run a performance check on a smaller model like Llama-3.2-1B? The overhead should be more significant there. Ideally we should add NVTX ranges and collect nsys profiles to isolate the differences in execution time for the scheduler.

tensorrt_llm/_torch/pyexecutor/scheduler.py

lancelly · 2026-01-15T06:00:05Z

Great work. I haven't look at the implementation in detail yet. Two suggestions:

I think we should add a simple test to CI so that we don't break the functionality by accident. E.g. run test_overlap_scheduler.py with the Python scheduler too.

Can we run a performance check on a smaller model like Llama-3.2-1B? The overhead should be more significant there. Ideally we should add NVTX ranges and collect nsys profiles to isolate the differences in execution time for the scheduler.

Thanks for the review! @Funatiq

Will do.
I'll run a perfmance check on Llama-3.2-1B. The time breakdown listed above is done by adding timer logs and also verified with nsys profiles.

Funatiq · 2026-01-15T14:34:41Z

Since you already have nsys profiles, could you report what the runtime for only the _schedule range is in both cases?

lancelly · 2026-01-16T05:31:10Z

Since you already have nsys profiles, could you report what the runtime for only the _schedule range is in both cases?

Sure, this image shows the result mentioned above(nsys reports different schedule time for each iteration, so we only recorded the avgs/medians). Detailes are in: https://docs.google.com/document/d/1he4S6hzDBApMGp2Bl5PTED-hcKaRmXDZbOi9EgJlK5A/edit?tab=t.0

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>

lancelly · 2026-01-18T06:10:14Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-18T06:13:21Z

PR_Github #32429 [ run ] triggered by Bot. Commit: baeee83

lancelly · 2026-01-18T06:32:35Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-18T06:38:05Z

PR_Github #32433 [ run ] triggered by Bot. Commit: baeee83

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>

lancelly · 2026-01-18T09:56:14Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-18T10:01:55Z

PR_Github #32442 [ run ] triggered by Bot. Commit: fc794cb

tensorrt-cicd · 2026-01-18T14:21:21Z

PR_Github #32442 [ run ] completed with state SUCCESS. Commit: fc794cb
/LLM/main/L0_MergeRequest_PR pipeline #25131 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

lancelly · 2026-01-18T14:29:01Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-18T14:34:36Z

PR_Github #32456 [ run ] triggered by Bot. Commit: fc794cb

lancelly · 2026-01-18T14:58:25Z

@Funatiq Hi, I added an E2E benchmark on Llama-3.2-1B as suggested. The hostoverhead also seems acceptable on Llama-3.2-1B:

Config: GB200 + Llama-3.2-1B + Agg
Server: ADP2 + max_batch_size 1536
Client: max-concurrency 3072 + isl_1k_osl_1k + num_prompts 30720
Result
- Output token throughput has around 1.4% gap after 50runs.
- Schedule time accounts for approximately 4.45% of host_step_time when using cpp scheduler. When using the cpp scheduler, the average per iteration host_step_time is 44.53 ms, while with the python scheduler it is 45.31 ms, an increase of 1.75%.

tensorrt-cicd · 2026-01-18T18:29:55Z

PR_Github #32456 [ run ] completed with state SUCCESS. Commit: fc794cb
/LLM/main/L0_MergeRequest_PR pipeline #25143 completed with status: 'SUCCESS'

Funatiq · 2026-01-19T13:11:30Z

Thanks for the benchmarks. Could you add a short summary to the PR description please?

lancelly · 2026-01-19T14:11:54Z

Thanks for the benchmarks. Could you add a short summary to the PR description please?

Sure, I have updated the PR description. I think this PR can be merged.

lancelly · 2026-01-19T15:01:16Z

@eopXD @nvpohanh Hi, could you please review/approve this PR? Thanks!

QiJune and others added 22 commits December 17, 2025 13:37

re-implement micro batch scheduler and capacity scheduler in python

64bce0f

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

refine

034fffb

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

enable SimpleUnifiedScheduler

927b417

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

fix

3609b20

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

fix

c901b21

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

fix

84cebc9

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

fix

490f8e9

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

fix

4e62403

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

fix

87caccb

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

fix

d1aebe7

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

fix

4d1f530

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

fix

641236d

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

enable py scheduler

162d59e

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

support bert

707fb4a

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

fix

fbc8486

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

fix

d344670

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

fix

6617a47

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

fix

63c09c6

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

fix gemma

2a3a7f2

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

fix lora

2f30b99

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

implement scheduler using python

411c254

Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>

lancelly requested review from a team as code owners December 24, 2025 09:20

lancelly requested a review from HuiGao-NV December 24, 2025 09:20

lancelly requested review from QiJune and litaotju December 24, 2025 09:22

Funatiq reviewed Jan 14, 2026

View reviewed changes

tensorrt_llm/_torch/pyexecutor/scheduler.py Outdated Show resolved Hide resolved

lancelly added 2 commits January 17, 2026 22:07

add ut for pyscheduler

f8c4a57

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>

use the buildtin types dict, set and tuple

baeee83

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>

NVIDIA deleted a comment from tensorrt-cicd Jan 18, 2026

fix ut

fc794cb

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>

Funatiq approved these changes Jan 19, 2026

View reviewed changes

lowsfer approved these changes Jan 20, 2026

View reviewed changes

QiJune merged commit dbb858a into NVIDIA:main Jan 20, 2026
9 checks passed

Conversation

lancelly commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lancelly commented Dec 24, 2025

Uh oh!

tensorrt-cicd commented Dec 24, 2025

Uh oh!

coderabbitai bot commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

tensorrt-cicd commented Jan 14, 2026

Uh oh!

lancelly commented Jan 14, 2026

Uh oh!

tensorrt-cicd commented Jan 14, 2026

Uh oh!

tensorrt-cicd commented Jan 14, 2026

Uh oh!

Funatiq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lancelly commented Jan 15, 2026

Uh oh!

Funatiq commented Jan 15, 2026

Uh oh!

lancelly commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lancelly commented Jan 18, 2026

Uh oh!

tensorrt-cicd commented Jan 18, 2026

Uh oh!

lancelly commented Jan 18, 2026

Uh oh!

tensorrt-cicd commented Jan 18, 2026

Uh oh!

lancelly commented Jan 18, 2026

Uh oh!

tensorrt-cicd commented Jan 18, 2026

Uh oh!

tensorrt-cicd commented Jan 18, 2026

Uh oh!

lancelly commented Jan 18, 2026

Uh oh!

tensorrt-cicd commented Jan 18, 2026

Uh oh!

lancelly commented Jan 18, 2026

Uh oh!

tensorrt-cicd commented Jan 18, 2026

Uh oh!

Funatiq commented Jan 19, 2026

Uh oh!

lancelly commented Jan 19, 2026

Uh oh!

lancelly commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

lancelly commented Dec 24, 2025 •

edited

Loading

coderabbitai bot commented Dec 24, 2025 •

edited

Loading

lancelly commented Jan 16, 2026 •

edited

Loading

lancelly commented Jan 19, 2026 •

edited

Loading